Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

[STUDENT NAME]

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__

Here is one example of an item that has a couple of null values

Show the code
df.replace([-999, "", " "], np.nan, inplace=True)
# Convert to correct datatype
df = df.convert_dtypes()

row = df.iloc[0]

empty_values = row[row.isna()]

# Display in json
example_record = empty_values.to_json(indent=4)
print(example_record)
{
    "num_of_delays_late_aircraft":null
}

QUESTION|TASK 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

I chose the metric of average delay hours. This I feel like is usually what determines a bad airport

Show the code
# Include and execute your code here
# Convert the delays columns to numeric (handle "1500+" as NaN for simplicity)
df["num_of_delays_carrier"] = pd.to_numeric(df["num_of_delays_carrier"], errors='coerce')
df["num_of_delays_late_aircraft"] = pd.to_numeric(df["num_of_delays_late_aircraft"], errors='coerce')
df["num_of_delays_nas"] = pd.to_numeric(df["num_of_delays_nas"], errors='coerce')
df["num_of_delays_security"] = pd.to_numeric(df["num_of_delays_security"], errors='coerce')
df["num_of_delays_weather"] = pd.to_numeric(df["num_of_delays_weather"], errors='coerce')

# Calculate the proportion of delayed flights for each airport
df["total_delays"] = df["num_of_delays_carrier"] + df["num_of_delays_late_aircraft"] + df["num_of_delays_nas"] + df["num_of_delays_security"] + df["num_of_delays_weather"]
df["proportion_delayed"] = df["total_delays"] / df["num_of_flights_total"]

# Calculate average delay time in hours
df["avg_delay_hours"] = df["minutes_delayed_total"] / 60 / df["num_of_flights_total"]

# Find the airport with the worst delays based on total proportion of delays
worst_airport = df.loc[df["proportion_delayed"].idxmax()]

# Display the summary table for each airport
summary_df = df[["airport_name", "num_of_flights_total", "total_delays", "proportion_delayed", "avg_delay_hours"]]
summary_df
airport_name num_of_flights_total total_delays proportion_delayed avg_delay_hours
0 Atlanta, GA: Hartsfield-Jackson Atlanta Intern... 35048 <NA> <NA> 0.221379
1 Denver, CO: Denver International 12687 3148 0.248128 0.240137
2 <NA> 12381 2432 0.19643 0.18157
3 Chicago, IL: Chicago O'Hare International 28194 9178 0.32553 0.377677
4 San Diego, CA: San Diego International 7283 1953 0.268159 0.209511
... ... ... ... ... ...
919 Washington, DC: Washington Dulles International 2799 443 0.158271 0.185566
920 Chicago, IL: Chicago O'Hare International 25568 4233 0.165559 0.200743
921 San Diego, CA: San Diego International 6231 1384 0.222115 0.195632
922 San Francisco, CA: San Francisco International 13833 4465 0.322779 0.433244
923 Salt Lake City, UT: Salt Lake City International 8804 1744 0.198092 0.203069

924 rows × 5 columns

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

I chose the porportion of delays. So basically the amount of delays compared to other months. From this September is the best moth to travel

Show the code
# Remove rows with missing "month"
df = df.dropna(subset=["month"])

# Aggregate delay statistics per month
month_delay_summary = df.groupby("month").agg({
    "num_of_delays_total": "sum",
    "num_of_flights_total": "sum"
}).reset_index()

# Calculate proportion of delays
month_delay_summary["delay_proportion"] = (
    month_delay_summary["num_of_delays_total"] / month_delay_summary["num_of_flights_total"]
)

# Order months correctly
months_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]
month_delay_summary["month"] = pd.Categorical(month_delay_summary["month"], categories=months_order, ordered=True)

# Plot
p = ggplot(month_delay_summary, aes(x="month", y="delay_proportion")) + geom_bar(stat="identity") + ggtitle("Proportion of Delays by Month")
p.show()

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

a. 100% of delayed flights in the Weather category are due to weather  
a. 30% of all delayed flights in the Late-Arriving category are due to weather  
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%    

This is pretty cool. Combining these numbers gives us a more accurate representation of what is going on.

Show the code
# Include and execute your code here
# Replace missing values in Late Aircraft with the mean
df["num_of_delays_late_aircraft"] = df["num_of_delays_late_aircraft"].astype(float).fillna(df["num_of_delays_late_aircraft"].mean())

# Create a new column for total weather delays
df["weather_delays"] = df["num_of_delays_weather"]

# Add 30% of Late Aircraft delays as weather delays
df["weather_delays"] += 0.3 * df["num_of_delays_late_aircraft"]

# Add NAS delays as weather delays (40% from April to August, else 65%)
df["weather_delays"] += df.apply(
    lambda row: 0.4 * row["num_of_delays_nas"] if row["month"] in ["April", "May", "June", "July", "August"]
    else 0.65 * row["num_of_delays_nas"], axis=1
)

# Display the first 5 rows with the new "weather_delays" column
df.head()
airport_code airport_name month year num_of_flights_total num_of_delays_carrier num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security num_of_delays_weather ... minutes_delayed_carrier minutes_delayed_late_aircraft minutes_delayed_nas minutes_delayed_security minutes_delayed_weather minutes_delayed_total total_delays proportion_delayed avg_delay_hours weather_delays
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January 2005 35048 <NA> 1109.104072 4598 10 448 ... 116423 104415 207467 297 36931 465533 <NA> <NA> 0.221379 3769.431222
1 DEN Denver, CO: Denver International January 2005 12687 1041 928.000000 935 11 233 ... 53537 70301 36817 363 21779 182797 3148 0.248128 0.240137 1119.15
2 IAD <NA> January 2005 12381 414 1058.000000 895 4 61 ... <NA> 70919 35660 208 4497 134881 2432 0.19643 0.18157 960.15
3 ORD Chicago, IL: Chicago O'Hare International January 2005 28194 1197 2255.000000 5415 5 306 ... 88691 160811 364382 151 24859 638894 9178 0.32553 0.377677 4502.25
4 SAN San Diego, CA: San Diego International January 2005 7283 572 680.000000 638 7 56 ... 27436 38445 21127 218 4326 91552 1953 0.268159 0.209511 674.7

5 rows × 21 columns

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

San Francisco has more delays than the other airports. I believe this could be to the fog that always surronds this airport and its cool to see this represented in data.

Show the code
# Fix missing values in `num_of_delays_late_aircraft`
df["num_of_delays_late_aircraft"] = pd.to_numeric(df["num_of_delays_late_aircraft"], errors="coerce")
df["num_of_delays_late_aircraft"].fillna(df["num_of_delays_late_aircraft"].mean(), inplace=True)

# Convert num_of_delays_nas to numeric (handling any errors)
df["num_of_delays_nas"] = pd.to_numeric(df["num_of_delays_nas"], errors="coerce")

# Convert `Month` to numeric and handle missing values
month_mapping = {
    "January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6,
    "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12
}
df["Month"] = df["month"].map(month_mapping)
df.dropna(subset=["Month"], inplace=True)  # Drop rows with missing months

# Compute total weather-related delays
df["total_weather_delays"] = (
    df["num_of_delays_weather"] +
    (0.30 * df["num_of_delays_late_aircraft"]) +
    np.where(df["Month"].between(4, 8), 0.40 * df["num_of_delays_nas"], 0.65 * df["num_of_delays_nas"])
)

# Calculate the proportion of total delays caused by weather
df["weather_delay_proportion"] = df["total_weather_delays"] / df["num_of_delays_total"]

# Remove rows where 'weather_delay_proportion' is NaN or infinite (caused by 0 total delays)
df_clean = df.dropna(subset=["weather_delay_proportion"])
df_clean = df_clean[df_clean["weather_delay_proportion"] != np.inf]

# Create the bar plot
p = ggplot(df_clean, aes(x="airport_code", y="weather_delay_proportion")) + \
    geom_bar(stat="identity") + \
    ggtitle("Proportion of Weather Delays Compared to Total Delays by Airport")

p.show()

STRETCH QUESTION|TASK 1

Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.

Its cool to see that weather is the largest factor in delays. Its quite incredibly that all the delays that are experienced are out of our control.

Show the code
# Include and execute your code here
# Calculate proportion of delays by type
df["carrier_delay_proportion"] = df["num_of_delays_carrier"] / df["num_of_flights_total"]
df["weather_delay_proportion"] = df["total_weather_delays"] / df["num_of_flights_total"]
df["security_delay_proportion"] = df["num_of_delays_security"] / df["num_of_flights_total"]

# Create a summary table
delay_summary = df.groupby("airport_code").agg({
    "carrier_delay_proportion": "mean",
    "weather_delay_proportion": "mean",
    "security_delay_proportion": "mean"
}).reset_index()

# Melt data for plotting
delay_summary_melted = delay_summary.melt(id_vars=["airport_code"], var_name="delay_type", value_name="proportion")

# Plot
p = ggplot(delay_summary_melted, aes(x="airport_code", y="proportion", fill="delay_type")) + \
    geom_bar(stat="identity", position="dodge") + \
    ggtitle("Comparison of Different Types of Delays by Airport")

p.show()